BSTA 6100 Fall 2025 Lab 01
2025-09-16
CSV stands for “comma separated values” and is a commonly used file type for storing data. Open the file “penguins.csv” from the files pane (lower right) to see what a .csv file looks like:
Each row of the file is an “observation” or “case”, and consists of one or more variables whose values are separated by commas (hey, look at that). The first row contains the names of the variables contained in the file.
"species","island","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","sex","year" "Adelie","Torgersen",39.1,18.7,181,3750,"male",2007 "Adelie","Torgersen",39.5,17.4,186,3800,"female",2007
We’re going to start by working with a data set with data on 333 penguins collected from 3 islands in the Palmer Archipeligo in Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and the data were prepared by Dr. Allison Horst.
We can read data into R using a function called read.csv(). The first argument to read.csv() is the name of a .csv file (here, penguins.csv), in quotes. We then store the results of read.csv() as an object called penguins.
The penguins object is called a data.frame.
head() to peek at a data.frameLet’s see what’s in the data. We can peek at the first few (6, specifically) rows of the data using the head() function:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
sex year
1 male 2007
2 female 2007
3 female 2007
4 female 2007
5 male 2007
6 female 2007
We read that line as “head of penguins”. Remember that penguins is what we named our data set. We can see that penguins contains a number of variables, like species, island, and more.
| Variable name | Description |
|---|---|
species |
Penguin species (Adélie, Chinstrap, and Gentoo) |
island |
Island in Palmer Archipeligo, Antarctica, on which the penguin was observed (Biscoe, Dream, or Torgersen) |
bill_length_mm |
A number denoting bill length (in millimeters) |
bill_depth_mm |
A number denoting bill depth (in millimeters) |
flipper_length_mm |
A whole number denoting flipper length (in millimeters) |
body_mass_g |
A whole number denoting penguin body mass (in grams) |
sex |
Penguin sex (female, male) |
year |
Study year (2007, 2008, 2009) |
str() to peek at a data.frameWe can also peek at the data using a function called str() (pronounced “stir”, short for “structure”):
'data.frame': 333 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Let’s start with the species variable. Is this a categorical or quantitative variable? How do you know?
To make a frequency table of a categorical variable, we use the table() function:
So, there are 119 Gentoo penguins in the data.
Pass a table to prop.table() to get a table of proportions:
$ OperatorNotice that we passed penguins$species to table(): we had to identify the data.frame that contains the variable species. The dollar sign ($) tells R to look inside the object on the left for the object on the right.
It’s very important that you tell R which data frame the variable you’re interested in is from. Let’s see what happens when we don’t:
Side note 1: If you were ever taught to use the attach() function to load a data.frame into the namespace, don’t do that!
Side note 2: When writing text in Quarto, if you want to use $, you must escape it with \. Write \$500 instead of $500.
We can also make “two-way” frequency tables (sometimes called “contingency tables”) to summarize counts for two categorical variables:
Biscoe Dream Torgersen
Adelie 44 55 47
Chinstrap 0 68 0
Gentoo 119 0 0
Data is Really Cool, so the first variable you give to table() is in the rows of the table, and the second is in the columns.
Bar graphs / charts / plots can be used to visualize categorical data.
table object to barplot() – the function takes “heights” as input.xlab is the x-axis label (in quotes)ylab is the y-axis label (in quotes)main is the main title (in quotes)col is a vector of color names that are applied in order of the entries in the passed table.Let’s start with the flipper_length_mm variable. Is this a categorical or quantitative variable? How do you know?
We can use R to summarize data numerically. We’ll use the summary() function to do that for a given variable. Here, we’ll summarize the flipper_length_mm variable, which is the length of the penguins’ flippers (in millimeters).
You can always get just the one numerical summary you’re looking for using the function for that specific summary:
Boxplots visualize the “5-number summary” (min, Q1, median, Q3, max) of a quantitative variable.
Histograms can be used to visualize the distribution of a quantitative variable.
Notice the unprofessional title and x-axis label: hardly anybody other than you understands your variable naming syntax.
Always provide main, xlab, and ylab arguments as appropriate when making plots, unless you’re doing something fast that you won’t show anyone else.
Here’s something better:
Sometimes we want to only look at a certain section of our data. To do this, we’ll create a subset.
data.frame you want to subset?Comparison in the R console for help)
==)! This is logical equals, which is a comparison operator. = is an assigment operator, like <-.Logical expressions (e.g., species == "Chinstrap") implicitly create TRUE/FALSE (“Boolean”) objects in R. The statement will be TRUE when an observation of the species variable is exactly “Chinstrap” (case-sensitive) and FALSE otherwise.
TRY IT! Fill in the chunk below to create subsets for the other species of penguin.
R’s data.frames inherit properties of arrays, which have rows and columns. (Remember that arrays are Really Cool, so we always write rows, columns.)
We can select particular rows or columns using logical expressions using square brackets []:
Since penguins is a two-dimensional array (like all data.frames), we must specify conditions for both rows and columns. Leaving the blank space after the comma tells R to select all columns.
Every variable in a data.frame is a vector: a 1-dimensional object. To subset it, we need only provide conditions on that single dimension.
Let’s subset body_mass_g by sex.
The sex variable in this data is either female or male (note the lowercase names!).
A scatterplot is a way to visualize relationships between two numeric variables. On the x-axis is typically the “explanatory” variable (denoted \(x\)), and on the y-axis is the “response” variable (denoted \(y\)). The data is paired (x,y), then each pair is plotted using an open circle.
The plot() function, when given two numeric variables, will create a scatterplot. The first argument to plot() is on the x axis; the second, on the y axis.
When describing a bivariable relationship in a scatterplot, focus on:
Notice that there might be some clustering happening. Let’s color the plot by species to see if that might explain what we’re seeing.
plot(penguins$flipper_length_mm, penguins$body_mass_g,
main = "Scatterplot of Body Mass vs. Flipper Length",
xlab = "Flipper Length (mm)",
ylab = "Body Mass (mm)",
col = c("darkorange1", "mediumorchid2", "darkcyan")[penguins$species])
legend("topleft",
legend = c("Adelie", "Chinstrap", "Gentoo"),
col = c("darkorange1", "mediumorchid2", "darkcyan"),
pch = 1)NOTE: The information in the legend is not tied to the plot by default. You can make a nonsense legend if you want (you don’t want this). Make sure your legend matches your plot!
Use the pch (plotting character) argument to plot(). Set pch to the number corresponding to the point you want. The default is 1, an open circle.
Let’s change the pch argument so that each species has a different color and plotting character.
plot(penguins$flipper_length_mm, penguins$body_mass_g,
main = "Scatterplot of Body Mass vs. Flipper Length",
xlab = "Flipper Length (mm)",
ylab = "Body Mass (mm)",
col = c("darkorange1", "mediumorchid2", "darkcyan")[penguins$species],
pch = c(0, 1, 2)[penguins$species])
legend("topleft",
legend = c("Adelie", "Chinstrap", "Gentoo"),
col = c("darkorange1", "mediumorchid2", "darkcyan"),
pch = c(0, 1, 2))The primary function of a graphical display is to convey information. Everything that goes on your plot needs to have a purpose and must convey information.
Use color only to convey information, and don’t rely on it too much.
khroma package (“Paul Tol colors”).ggplot2).More tips: https://nbisweden.github.io/Rcourse/files/rules_for_using_color.pdf
If you forget the selector on the col or pch arguments, bad things happen!
~ OperatorIn R, we can use ~ (tilde, found underneath the Esc key in the top left corner of a U.S. English keyboard) as an operator that can be read as “by” (or “versus”). This operator has use in making several plots we have discussed in the past.
Let’s make side-by-side boxplots of the numeric variable body_mass_g by species.
boxplot(penguins$body_mass_g ~ penguins$species,
main = "Side-by-Side Boxplots of Body Mass by Penguin Species",
xlab = "Species",
ylab = "Body Mass in Grams")We could also look at only two species by passing multiple arguments:
boxplot(penguins$body_mass_g[penguins$species == "Adelie"],
penguins$body_mass_g[penguins$species == "Chinstrap"],
names = c("Adelie", "Chinstrap"),
main = "Side-by-Side Boxplots of Body Mass by Penguin Species",
xlab = "Species",
ylab = "Body Mass in Grams")Let’s go back to the scatterplot we made last week and update it to use the ~ operator. We will also update the code to reflect that we can now send to plot the name of the data set using the data argument, letting us skip the $.
plot(body_mass_g ~ bill_length_mm,
data = penguins,
main = "Scatterplot of Penguin Body Mass versus Bill Length",
xlab = "Bill Length (mm)",
ylab = "Body Mass in (g)")Notice the order here: the y variable (body_mass_g) is written first, then the tilde, then the x variable (bill_length_mm). This is because for scatterplots, the order is y by x or y ~ x. Be very careful setting up scatterplots!
R has a native “pipe” operator |> that passes the result of the left-hand-side expression to the right-hand-side expression as the first argument in the call.
x |> f(y) is interpreted as f(x, y)
Base R graphics rely on “graphical parameters” that are set either inside or outside calls to plotting functions. (See ?par for full details.)
To put two plots side by side, we can set the mfrow graphical parameter (mf might stand for “matrix figure”) before calling plot().
par(mfrow = c(1, 2)) # 1 row, 2 columns
plot(body_mass_g ~ flipper_length_mm, data = penguins,
main = "Body Mass vs. Flipper Length",
xlab = "Flipper length (mm)",
ylab = "Body mass (g)")
plot(body_mass_g ~ bill_length_mm, data = penguins,
main = "Body Mass vs. Bill Length",
xlab = "Bill length (mm)",
ylab = "Body mass (g)")NOTE: When outside of an RMarkdown or Quarto document, you’ll sometimes need to reset the graphical parameters. Do this by calling dev.off() or by calling par() with the original parameters.
If you have a long figure title, you can break it onto multiple lines with \n:
par(mfrow = c(1, 2))
plot(body_mass_g ~ flipper_length_mm, data = penguins,
main = "This is a very very long figure title that gets cut off because it's too long",
xlab = "Flipper length (mm)",
ylab = "Body mass (g)")
plot(body_mass_g ~ flipper_length_mm, data = penguins,
main = "This is a very very long figure title\nthat gets cut off because it's too long",
xlab = "Flipper length (mm)",
ylab = "Body mass (g)")tapply()The tapply() applies a function to a matrix. The function can be a predefined R function like mean() or a user-defined function.
The power of tapply() is that it allows for a vector to be split into groups, with the function applied to each group.
The function has the generic structure
tapply()# calculate mean flipper length for each penguin species
tapply(penguins$flipper_length_mm, penguins$species, mean) Adelie Chinstrap Gentoo
190.1027 195.8235 217.2353
# summarize flipper length for each penguin species
tapply(penguins$flipper_length_mm, penguins$species, summary)$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max.
172.0 186.0 190.0 190.1 195.0 210.0
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
178.0 191.0 196.0 195.8 201.0 212.0
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max.
203.0 212.0 216.0 217.2 221.5 231.0